This turorial demonstrates training of a reinforcement learning agent using federated learning in a CartPole environment. Before running this program you would need to install OpenAI gym.
To train our agent we would be using a policy which uses a simple neural network that maps the CartPole environment's state space to an action space. This policy is trained using federated learning with the help of the Pysyft library. The program simulates that the policy training happens on a remote machine (represented by the remote worker Bob).
In [1]:
import torch
from torch import nn, optim
import torch.nn.functional as F
from torch.distributions import Categorical
import gym
import numpy as np
import syft as sy
In [2]:
env = gym.make('CartPole-v0')
In [3]:
hook = sy.TorchHook(torch)
bob = sy.VirtualWorker(hook, id="bob")
In [4]:
class Policy(nn.Module):
def __init__(self):
super(Policy, self).__init__()
self.input = nn.Linear(4, 4)
self.output = nn.Linear(4, 2)
self.episode_log_probs = []
self.episode_raw_rewards = []
def forward(self, x):
x = self.input(x)
x = F.relu(x)
x = self.output(x)
x = F.softmax(x, dim=1)
return x
In [5]:
policy = Policy()
optimizer = optim.SGD(params=policy.parameters(), lr=0.03)
#discount rate to be used for action score calculation
discount_rate = 0.95
In [6]:
def select_action(state):
state = torch.from_numpy(state).float().unsqueeze(0)
#send the environment state to bob
state = state.send(bob)
probs = policy(state)
#we need to get the estimated probabilities back to sample the action since Categorical does not yet
#support remote tensor operations as of now
probs = probs.get()
m = Categorical(probs)
action = m.sample()
policy.episode_log_probs.append(m.log_prob(action))
#get the state back as we would be sending the new state to bob
state.get()
return action.item()
def discount_and_normailze_rewards():
discounted_rewards = []
cumulative_rewards = 0
for reward in policy.episode_raw_rewards[::-1]:
cumulative_rewards = reward + discount_rate * cumulative_rewards
discounted_rewards.insert(0, cumulative_rewards)
discounted_rewards = torch.tensor(discounted_rewards)
discounted_rewards = (discounted_rewards - discounted_rewards.mean())/discounted_rewards.std()
return discounted_rewards
def update_policy():
policy_loss = []
discounted_rewards = discount_and_normailze_rewards()
for log_prob, action_score in zip(policy.episode_log_probs, discounted_rewards):
policy_loss.append(-log_prob * action_score)
optimizer.zero_grad()
policy_loss = torch.cat(policy_loss).sum()
policy_loss.backward()
optimizer.step()
del policy.episode_log_probs[:]
del policy.episode_raw_rewards[:]
In [7]:
total_rewards = []
# send the policy to bob for training
policy.send(bob)
for episode in range(500):
state = env.reset()
episode_rewards = 0
for step in range(1000):
action = select_action(state)
state, reward, done, _ = env.step(action)
#env.render() #uncomment to render the current environment
policy.episode_raw_rewards.append(reward)
episode_rewards += reward
if done:
break
#to keep track of rewards earned in each episode
total_rewards.append(episode_rewards)
update_policy()
#cleanup
policy.get()
bob.clear_objects()
print('Average reward: {:.2f}\tMax reward: {:.2f}'.format(np.mean(total_rewards), np.max(total_rewards)))
Our agent managed to keep the pole upright for a maximum of 83 consecutive steps using a very simple neural network policy trained using federated learning with Pysyft.
In select_state method we have to get the estimated probabilities back to our local worker to sample the action since Categorical does not support remote tensor operations as of now.
Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!
The easiest way to help our community is just by starring the repositories! This helps raise awareness of the cool tools we're building.
We made really nice tutorials to get a better understanding of what Federated and Privacy-Preserving Learning should look like and how we are building the bricks for this to happen.
The best way to keep up to date on the latest advancements is to join our community!
The best way to contribute to our community is to become a code contributor! If you want to start "one off" mini-projects, you can go to PySyft GitHub Issues page and search for issues marked Good First Issue.
If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!